Exploratory Data Analysis de datasets complementarios

En este notebook se realiza un análisis exploratorio inicial (EDA) de múltiples conjuntos de datos candidatos a complementar el conjunto de datos final utilizado en el proyecto. Los conjuntos de datos a analizar son los siguientes:

Importación de librerías y setup inicial

In [1]:
from pathlib import Path
import os
import pandas as pd
from sklearn.model_selection import train_test_split

import imagesize

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

import json

from waste_detection_system import shared_data as base, utils, dataset_creator
d:\anaconda3\envs\waste-detector\lib\site-packages\tqdm\auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
In [2]:
# plot style
# ==============================================================================
plt.rcParams['axes.titlesize'] = 12
plt.rcParams['figure.titlesize'] = 16

TACO (Trash Annotations in COntext)

Se trata de un conjunto de datos en un contexto menos ideal (imágenes capturadas con cámaras variadas en entornos mayormente exteriores), con una o varias anotaciones por imagen. Las categorías reflejadas en este conjunto de datos son:

  • Otros: 48.93%
  • Plástico: 39.82%
  • Papel: 05.89%
  • Vidrio: 05.18%
  • Orgánico: 00.16%

Las anotaciones se encuentran en formato COCO.

In [3]:
images = {
    'name' : [],
    'path' : [],
    'width' : [],
    'height' : [],
    'label' : [],
    'bbox-x' : [],
    'bbox-y' : [],
    'bbox-w' : [],
    'bbox-h' : [],
}

json_file = {}
with open(base.TACO / 'annotations.json', 'r') as _file:
      json_file = json.load(_file)

for image in json_file['images']:
  anns = [item for item in json_file['annotations'] 
          if item['image_id'] == image['id']]

  for ann in anns:
    cat = [item for item in json_file['categories']
          if item['id'] == ann['category_id']][0]
    images['name'] = images['name'] + [image['file_name']]
    images['path'] = images['path'] + [str(base.TACO / image['file_name'])]
    images['width'] = images['width'] + [image['width']]
    images['height'] = images['height'] + [image['height']]
    images['label'] = images['label'] + [cat['name']]
    bbox = ann['bbox']
    images['bbox-x'] = images['bbox-x'] + [bbox[0]]
    images['bbox-y'] = images['bbox-y'] + [bbox[1]]
    images['bbox-w'] = images['bbox-w'] + [bbox[2]]
    images['bbox-h'] = images['bbox-h'] + [bbox[3]]

images_df = pd.DataFrame(images)
images_df['label'] = [base.RELATION_CATS[label.upper()] for label in images_df['label']]

train, test = train_test_split(images_df, test_size=0.2, 
                               stratify=images_df[['label']])
train, val = train_test_split(train, test_size=0.15,
                              stratify=train[['label']])
train['type'] = 'train'
val['type'] = 'val'
test['type'] = 'test'

images_df = pd.concat([train, val, test])
In [4]:
images_df.head(n=10)
Out[4]:
name path width height label bbox-x bbox-y bbox-w bbox-h type
294 batch_1/000128.JPG raw-datasets\TACO\batch_1\000128.JPG 2448 3264 PLASTICO 1850.0 1007.0 598.0 531.0 train
2448 batch_2/000049.JPG raw-datasets\TACO\batch_2\000049.JPG 3163 2448 PLASTICO 3011.0 974.0 79.0 45.0 train
3545 batch_6/000035.JPG raw-datasets\TACO\batch_6\000035.JPG 2448 3264 PLASTICO 1221.0 1735.0 359.0 361.0 train
3870 batch_7/000078.JPG raw-datasets\TACO\batch_7\000078.JPG 2448 3264 PLASTICO 984.0 1885.0 241.0 164.0 train
134 batch_1/000068.JPG raw-datasets\TACO\batch_1\000068.JPG 2448 3264 OTROS 170.0 1508.0 86.0 154.0 train
3220 batch_5/000058.JPG raw-datasets\TACO\batch_5\000058.JPG 2448 3264 OTROS 1552.0 210.0 10.0 14.0 train
2728 batch_3/IMG_4948.JPG raw-datasets\TACO\batch_3\IMG_4948.JPG 2448 2470 PLASTICO 1403.0 1525.0 50.0 30.0 train
2082 batch_15/000038.jpg raw-datasets\TACO\batch_15\000038.jpg 3024 4032 PLASTICO 1103.0 2533.0 34.0 189.0 train
2212 batch_2/000007.JPG raw-datasets\TACO\batch_2\000007.JPG 2448 3264 PLASTICO 1197.0 1672.0 193.0 275.0 train
342 batch_10/000003.jpg raw-datasets\TACO\batch_10\000003.jpg 4000 1824 OTROS 211.0 203.0 197.0 108.0 train
In [5]:
len(images_df.index)
Out[5]:
4784
In [6]:
images_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4784 entries, 294 to 4014
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   name    4784 non-null   object 
 1   path    4784 non-null   object 
 2   width   4784 non-null   int64  
 3   height  4784 non-null   int64  
 4   label   4784 non-null   object 
 5   bbox-x  4784 non-null   float64
 6   bbox-y  4784 non-null   float64
 7   bbox-w  4784 non-null   float64
 8   bbox-h  4784 non-null   float64
 9   type    4784 non-null   object 
dtypes: float64(4), int64(2), object(4)
memory usage: 411.1+ KB
In [7]:
images_df['type'].value_counts()
Out[7]:
train    3252
test      957
val       575
Name: type, dtype: int64
In [8]:
images_df['type'].value_counts(normalize=True)
Out[8]:
train    0.679766
test     0.200042
val      0.120192
Name: type, dtype: float64
In [9]:
images_df['label'].value_counts(normalize=True)
Out[9]:
OTROS       0.489339
PLASTICO    0.398202
PAPEL       0.058946
VIDRIO      0.051839
ORGANICO    0.001672
Name: label, dtype: float64
In [10]:
sample_imgs = images_df[(images_df.type == 'train')].sample(n=3)
utils.plot_data_sample(sample_imgs, images_df)
In [11]:
with open(base.TACO_CSV, 'w', encoding='utf-8-sig') as f:
  images_df.to_csv(f, index=False)

Drinking Waste

Se trata de un conjunto de imágenes en contexto plano (menos ideal) de botellas de vidrio, tetrapacks, botellas de plástico y latas de aluminio. La mayoría de las imágenes solo contiene un objeto, aunque hay algunas que tienen varios objetos.También cabe destacar que este conjunto de datos está aumentado, es decir, la misma imagene se repite en diferentes rotaciones y espejada.

  • Plástico: 74.68%
  • Vidrio: 25.31%

Las anotaciones se encuentran en TXTs individuales en formato YOLO.

In [12]:
img_files = []
txt_files = []
img_txt = {}

for dirpath, dirs, filenames in os.walk(base.DRINKING_WASTE):
  txt_files = txt_files + [Path(dirpath)/filename for filename 
                           in filenames if filename.endswith('.txt')]
  img_files = img_files + [Path(dirpath)/filename for filename 
                           in filenames if not filename.endswith('.txt')]

for txt_f in txt_files:
  associated_img = [img for img in img_files if txt_f.stem == img.stem][0]
  img_txt[associated_img] = txt_f
In [13]:
unannotated = [img for img in img_files if not img in img_txt.keys()]
print(f'Imágenes sin fichero de anotaciones: {len(unannotated)}')
print(*unannotated, sep='\n')
Imágenes sin fichero de anotaciones: 18
raw-datasets\drinking-waste\Glass1,081.heic
raw-datasets\drinking-waste\Glass1,082.HEIC
raw-datasets\drinking-waste\Glass1,083.heic
raw-datasets\drinking-waste\Glass1,084.HEIC
raw-datasets\drinking-waste\Glass1,085.heic
raw-datasets\drinking-waste\Glass1,086.HEIC
raw-datasets\drinking-waste\Glass1,087.heic
raw-datasets\drinking-waste\Glass1,088.HEIC
raw-datasets\drinking-waste\Glass664.jpg
raw-datasets\drinking-waste\Glass665.JPG
raw-datasets\drinking-waste\Glass666.jpg
raw-datasets\drinking-waste\Glass667.JPG
raw-datasets\drinking-waste\Glass668.jpg
raw-datasets\drinking-waste\Glass669.JPG
raw-datasets\drinking-waste\Glass670.jpg
raw-datasets\drinking-waste\Glass671.JPG
raw-datasets\drinking-waste\Glass672.jpg
raw-datasets\drinking-waste\Glass673.JPG
In [14]:
images = {
    'name' : [],
    'path' : [],
    'width' : [],
    'height' : [],
    'label' : [],
    'bbox-x' : [],
    'bbox-y' : [],
    'bbox-w' : [],
    'bbox-h' : [],
}

for img, txt in img_txt.items():
  w, h = imagesize.get(str(img))
  df = pd.read_csv(str(txt), delim_whitespace=True, header=None, 
                   names=['label', 'x', 'y', 'width', 'height'])
  bb_x, bb_y, bb_w, bb_h = utils.yolo2coco(df.x.iloc[0], 
    df.y.iloc[0], df.width.iloc[0], df.height.iloc[0], w, h)

  images['name'] = images['name'] + [img.name]
  images['path'] = images['path'] + [str(img)]
  images['width'] = images['width'] + [w]
  images['height'] = images['height'] + [h]
  images['label'] = images['label'] + [base.RELATION_CATS[df.label.iloc[0].astype(str)]]
  images['bbox-x'] = images['bbox-x'] + [bb_x]
  images['bbox-y'] = images['bbox-y'] + [bb_y]
  images['bbox-w'] = images['bbox-w'] + [bb_w]
  images['bbox-h'] = images['bbox-h'] + [bb_h]
In [15]:
images_df = pd.DataFrame(images)

train, test = train_test_split(images_df, test_size=0.2, 
                               stratify=images_df[['label']])
train, val = train_test_split(train, test_size=0.15,
                              stratify=train[['label']])
train['type'] = 'train'
val['type'] = 'val'
test['type'] = 'test'

images_df = pd.concat([train, val, test])
In [16]:
images_df.head(n=10)
Out[16]:
name path width height label bbox-x bbox-y bbox-w bbox-h type
3500 PET1,178.jpg raw-datasets\drinking-waste\PET1,178.jpg 512 683 PLASTICO 203 120 100 61 train
4261 PET431.jpg raw-datasets\drinking-waste\PET431.jpg 384 512 PLASTICO 143 97 114 256 train
2831 HDPEM578.jpg raw-datasets\drinking-waste\HDPEM578.jpg 512 683 PLASTICO 148 297 184 238 train
674 AluCan652.jpg raw-datasets\drinking-waste\AluCan652.jpg 512 683 PLASTICO 289 232 133 117 train
1867 Glass620.jpg raw-datasets\drinking-waste\Glass620.jpg 512 683 VIDRIO 97 218 346 170 train
4705 PET883.jpg raw-datasets\drinking-waste\PET883.jpg 512 683 PLASTICO 180 350 107 41 train
4378 PET548.jpg raw-datasets\drinking-waste\PET548.jpg 512 683 PLASTICO 343 387 104 125 train
331 AluCan343.jpg raw-datasets\drinking-waste\AluCan343.jpg 512 683 PLASTICO 270 446 160 130 train
593 AluCan58.jpg raw-datasets\drinking-waste\AluCan58.jpg 512 384 PLASTICO 149 126 180 199 train
1427 Glass156.jpg raw-datasets\drinking-waste\Glass156.jpg 512 384 VIDRIO 0 7 511 376 train
In [17]:
len(images_df.index)
Out[17]:
4811
In [18]:
images_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4811 entries, 3500 to 954
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   name    4811 non-null   object
 1   path    4811 non-null   object
 2   width   4811 non-null   int64 
 3   height  4811 non-null   int64 
 4   label   4811 non-null   object
 5   bbox-x  4811 non-null   int64 
 6   bbox-y  4811 non-null   int64 
 7   bbox-w  4811 non-null   int64 
 8   bbox-h  4811 non-null   int64 
 9   type    4811 non-null   object
dtypes: int64(6), object(4)
memory usage: 413.4+ KB
In [19]:
images_df['type'].value_counts()
Out[19]:
train    3270
test      963
val       578
Name: type, dtype: int64
In [20]:
images_df['type'].value_counts(normalize=True)
Out[20]:
train    0.679692
test     0.200166
val      0.120141
Name: type, dtype: float64
In [21]:
images_df['label'].value_counts(normalize=True)
Out[21]:
PLASTICO    0.74683
VIDRIO      0.25317
Name: label, dtype: float64
In [22]:
sample_imgs = images_df[(images_df.type == 'train')].sample(n=5)
utils.plot_data_sample(sample_imgs, images_df)
In [23]:
with open(base.DRINKING_WASTE_CSV, 'w', encoding='utf-8-sig') as f:
  images_df.to_csv(f, index=False)

Cig Butts

Se trata de un conjunto de imágenes en contexto de colillas generadas artificialmente. Solo hay un objeto por imagen.

  • Otros: 100%

Las anotaciones se encuentran en formato COCO.

In [24]:
jsons = {}

for dirpath, dirs, filenames in os.walk(base.CIG_BUTTS):
  full_path = [str(Path(dirpath)/filename) for filename in filenames
               if filename.endswith('.json')]
  for _path in full_path:
    with open(_path, 'r') as _file:
      jsons[dirpath] = json.load(_file)
In [25]:
partitions = {'test', 'train', 'val'}
images = {
    'name' : [],
    'path' : [],
    'width' : [],
    'height' : [],
    'type' : [],
    'label' : [],
    'bbox-x' : [],
    'bbox-y' : [],
    'bbox-w' : [],
    'bbox-h' : [],
}

for _path, json_file in jsons.items():
  partition_type = [part for part in partitions if part in _path][0]
  for image in json_file['images']:
    anns = [item for item in json_file['annotations'] 
           if item['image_id'] == image['id']]

    for ann in anns:
      cat = [item for item in json_file['categories']
            if item['id'] == ann['category_id']][0]
      images['name'] = images['name'] + [image['file_name']]
      images['path'] = images['path'] + [str(Path(_path)/'images'/image['file_name'])]
      images['width'] = images['width'] + [image['width']]
      images['height'] = images['height'] + [image['height']]
      images['type'] = images['type'] + [partition_type]
      images['label'] = images['label'] + [cat['name']]
      bbox = ann['bbox']
      images['bbox-x'] = images['bbox-x'] + [bbox[0]]
      images['bbox-y'] = images['bbox-y'] + [bbox[1]]
      images['bbox-w'] = images['bbox-w'] + [bbox[2]]
      images['bbox-h'] = images['bbox-h'] + [bbox[3]]

images_df = pd.DataFrame(images)
images_df['label'] = [base.RELATION_CATS[label.upper()] for label in images_df['label']]

train, test = train_test_split(images_df[(images_df.type == 'train')], test_size=0.1)
test.type = 'test'

images_df = pd.concat([train, test, images_df[(images_df.type == 'val')]])
In [26]:
images_df.head(n=10)
Out[26]:
name path width height type label bbox-x bbox-y bbox-w bbox-h
1845 00001845.jpg raw-datasets\cig_butts\train\images\00001845.jpg 512 512 train OTROS 59.5 141.0 76.5 48.5
1487 00001487.jpg raw-datasets\cig_butts\train\images\00001487.jpg 512 512 train OTROS 164.5 177.5 91.0 98.0
252 00000252.jpg raw-datasets\cig_butts\train\images\00000252.jpg 512 512 train OTROS 66.5 296.5 46.0 80.0
56 00000056.jpg raw-datasets\cig_butts\train\images\00000056.jpg 512 512 train OTROS 24.5 394.5 77.0 73.0
797 00000797.jpg raw-datasets\cig_butts\train\images\00000797.jpg 512 512 train OTROS 300.5 354.5 68.0 34.0
175 00000175.jpg raw-datasets\cig_butts\train\images\00000175.jpg 512 512 train OTROS 170.5 190.5 77.0 76.0
578 00000578.jpg raw-datasets\cig_butts\train\images\00000578.jpg 512 512 train OTROS 36.5 205.5 121.0 35.0
444 00000444.jpg raw-datasets\cig_butts\train\images\00000444.jpg 512 512 train OTROS 94.5 196.5 148.0 119.0
866 00000866.jpg raw-datasets\cig_butts\train\images\00000866.jpg 512 512 train OTROS 27.5 133.5 88.0 61.0
1102 00001102.jpg raw-datasets\cig_butts\train\images\00001102.jpg 512 512 train OTROS 78.5 77.5 30.0 47.0
In [27]:
len(images_df.index)
Out[27]:
2200
In [28]:
images_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2200 entries, 1845 to 2199
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   name    2200 non-null   object 
 1   path    2200 non-null   object 
 2   width   2200 non-null   int64  
 3   height  2200 non-null   int64  
 4   type    2200 non-null   object 
 5   label   2200 non-null   object 
 6   bbox-x  2200 non-null   float64
 7   bbox-y  2200 non-null   float64
 8   bbox-w  2200 non-null   float64
 9   bbox-h  2200 non-null   float64
dtypes: float64(4), int64(2), object(4)
memory usage: 189.1+ KB
In [29]:
images_df['type'].value_counts()
Out[29]:
train    1800
test      200
val       200
Name: type, dtype: int64
In [30]:
images_df['type'].value_counts(normalize=True)
Out[30]:
train    0.818182
test     0.090909
val      0.090909
Name: type, dtype: float64
In [31]:
images_df['label'].value_counts(normalize=True)
Out[31]:
OTROS    1.0
Name: label, dtype: float64
In [32]:
sample_imgs = images_df[(images_df.type == 'train')].sample(n=3)
utils.plot_data_sample(sample_imgs, images_df)
In [33]:
with open(base.CIG_BUTTS_CSV, 'w', encoding='utf-8-sig') as f:
  images_df.to_csv(f, index=False)

Unión de los datasets anotados

Se procede a unir los datasets después de haber normalizado su entrada a formato COCO. Cada fila del conjunto de datos resultante es una anotación.

In [44]:
with open(base.CIG_BUTTS_CSV, 'r', encoding='utf-8-sig') as f:
    cigbutts = pd.read_csv(f)
with open(base.DRINKING_WASTE_CSV, 'r', encoding='utf-8-sig') as f:
    drinkingwaste = pd.read_csv(f)
with open(base.TACO_CSV, 'r', encoding='utf-8-sig') as f:
    taco = pd.read_csv(f)

df = pd.concat([cigbutts, drinkingwaste, taco])
In [45]:
df.head(n=5)
Out[45]:
name path width height type label bbox-x bbox-y bbox-w bbox-h
0 00001845.jpg raw-datasets\cig_butts\train\images\00001845.jpg 512 512 train OTROS 59.5 141.0 76.5 48.5
1 00001487.jpg raw-datasets\cig_butts\train\images\00001487.jpg 512 512 train OTROS 164.5 177.5 91.0 98.0
2 00000252.jpg raw-datasets\cig_butts\train\images\00000252.jpg 512 512 train OTROS 66.5 296.5 46.0 80.0
3 00000056.jpg raw-datasets\cig_butts\train\images\00000056.jpg 512 512 train OTROS 24.5 394.5 77.0 73.0
4 00000797.jpg raw-datasets\cig_butts\train\images\00000797.jpg 512 512 train OTROS 300.5 354.5 68.0 34.0
In [46]:
len(df.index)
Out[46]:
11795
In [47]:
df['type'].value_counts()
Out[47]:
train    8322
test     2120
val      1353
Name: type, dtype: int64
In [48]:
df['type'].value_counts(normalize=True)
Out[48]:
train    0.705553
test     0.179737
val      0.114710
Name: type, dtype: float64
In [49]:
df['label'].value_counts(normalize=True)
Out[49]:
PLASTICO    0.466130
OTROS       0.384994
VIDRIO      0.124290
PAPEL       0.023908
ORGANICO    0.000678
Name: label, dtype: float64
In [50]:
df['label'].value_counts()
Out[50]:
PLASTICO    5498
OTROS       4541
VIDRIO      1466
PAPEL        282
ORGANICO       8
Name: label, dtype: int64
In [51]:
df['path'].map(lambda p: Path(p).suffix).value_counts()
Out[51]:
.jpg    9571
.JPG    2024
.png     200
Name: path, dtype: int64
In [52]:
fig, axes = plt.subplots(ncols=1, nrows=2, figsize=(15,10), dpi=300, sharey=True)
fig.suptitle('Tamaño de las imágenes del dataset', fontsize=20)
sns.set_theme()

width_plot = axes[0]
height_plot = axes[1]

width_info = df['width'].value_counts(normalize=True, bins=10).reset_index()
width_info['index'] = ['[{:.1f} : {:.1f}]'.format(x.left, x.right) for x in width_info['index']]
height_info = df['height'].value_counts(normalize=True, bins=10).reset_index()
height_info['index'] = ['[{:.1f} : {:.1f}]'.format(x.left, x.right) for x in height_info['index']]

width_plot = sns.barplot(ax=width_plot, data=width_info.reset_index(), x='index', y='width', palette='rocket', 
    hue='width', dodge=False)
for container in width_plot.containers:
    width_plot.bar_label(container, fmt='%.2f')
width_plot.get_legend().remove()
width_plot.set(xlabel='anchura', ylabel='', ylim=(0.0, 1.0))

height_plot = sns.barplot(ax=height_plot, data=height_info.reset_index(), x='index', y='height', palette='rocket', 
    hue='height', dodge=False)
for container in height_plot.containers:
    height_plot.bar_label(container, fmt='%.2f')
height_plot.get_legend().remove()
height_plot.set(xlabel='altura', ylabel='', ylim=(0.0, 1.0))

fig.tight_layout()
plt.show()

Se observa la distribución de observaciones por etiqueta:

PAPEL PLASTICO VIDRIO METAL ORGANICO OTROS TOTAL
% 47.55% 39.02% 3.53% 0.99% 0.02% 8.87% 100%
Nº observaciones 17.751 + 587 8.633 + 6.416 0 + 1.362 382 + 0 0 + 8 0 + 3.422 38.561

El conjunto de datos no está balanceado: papel y plástico ocupan algo más del 85% del conjunto de datos, mientras que solo hay 8 observaciones de residuos orgánicos. Este dataset no representa la distribución de datos real que se va a encontrar en el ámbito de aplicación: recuperación de residuos reciclables de la fracción resto, ya que existen poblaciones en que no se separa el residuo orgánico (contenedor marrón) de la fracción resto (contenedor gris).

Conclusiones

Se ha decidido descartar la creación de un conjunto de datos completo y balanceado debido a que los nuevos datos estaban fuera de contexto (Garbage In, Garbage Out) y no solucionaban el problema de desbalanceo de clases, al no aportar suficientes muestras de las clases minoritarias. Así mismo, se ha decidido acotar el ámbito del estudio a la detección y clasificación de papel y plástico (se descarta la categoría metal por no tener una muestra significativa).